Phrasal Rank-Encoding: Exploiting Phrase Redundancy and Translational Relations for Phrase Table Compression
نویسنده
چکیده
Wedescribe Phrasal Rank-Encoding (PR-Enc), a novel method for the compression of wordaligned target language data in phrase tables as used in phrase-based SMT. This method reduces the redundancy in phrase tables which is a direct effect of the phrase-based approach. A combination of PR-Enc with Huffman coding allows to reduce the size of an aggressively compressed phrase table by another 39 percent. Using this and other methods for space reduction in a new binary phrase table implementation, a size reduction by an order of magnitude is achieved when comparing to the Moses on-disk phrase table implementation. Concerning decoding speed, all variants of the new phrase table are faster than the Moses binary phrase table implementation while the PR-Enc encoded variant outperforms all other methods.
منابع مشابه
A Phrase Table without Phrases: Rank Encoding for Better Phrase Table Compression
This paper describes the first steps towards a minimum-size phrase table implementation to be used for phrase-based statistical machine translation. The focus lies on the size reduction of target language data in a phrase table. Rank Encoding (REnc), a novel method for the compression of word-aligned target language in phrase tables is presented. Combined with Huffman coding a relative size red...
متن کاملAugmenting Phrase Table by Employing Lexicons for Pivot-based SMT
Pivot language is employed as a way to solve the data sparseness problem in machine translation, especially when the data for a particular language pair does not exist. The combination of source-to-pivot and pivot-to-target translation models can induce a new translation model through the pivot language. However, the errors in two models may compound as noise, and still, the combined model may ...
متن کاملHallucinating Phrase Translations for Low Resource MT
We demonstrate that “hallucinating” phrasal translations can significantly improve the quality of machine translation in low resource conditions. Our hallucinated phrase tables consist of entries composed from multiple unigram translations drawn from the baseline phrase table and from translations that are induced from monolingual corpora. The hallucinated phrase table is very noisy. Its transl...
متن کاملMandarin Chinese Prosodic Phrase Grouping and Modeling—Method and Implications
One major feature of the prosody of Mandarin Chinese speech flow is prosodic phrase grouping [1, 2, and 3]. Phrasal and sentential intonations are governed by a prosody framework that structurally group phrases into a larger/longer and identifiable unit. An overall prosody pattern of such phrase grouping with prosodic specifications is superimposed on phrase group. In turn, individual phrasal i...
متن کاملPhrasal complexity and the occurrence of filled pauses in presentation speeches in Japanese
Filled pauses are ubiquitous in everyday speech. I investigated whether linguistic complexity of upcoming phrases affects filler rate at phrase boundaries in presentation speeches in Japanese. Filler rate at phrase boundaries increased monotonically with complexity of the following phrases. However, when the following phrase was composed of more than 11 Bunsetsu-phrases, the filler rate did not...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Prague Bull. Math. Linguistics
دوره 98 شماره
صفحات -
تاریخ انتشار 2012